On the equivalence between marker effect models and breeding value models and direct genomic values with the Algorithm for Proven and Young

Background Single-step genomic predictions obtained from a breeding value model require calculating the inverse of the genomic relationship matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$({\mathbf{G}}^{-1})$$\end{document}(G-1). The Algorithm for Proven and Young (APY) creates a sparse representation of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf{G}}^{-1}$$\end{document}G-1 with a low computational cost. APY consists of selecting a group of core animals and expressing the breeding values of the remaining animals as a linear combination of those from the core animals plus an error term. The objectives of this study were to: (1) extend APY to marker effects models; (2) derive equations for marker effect estimates when APY is used for breeding value models, and (3) show the implication of selecting a specific group of core animals in terms of a marker effects model. Results We derived a family of marker effects models called APY-SNP-BLUP. It differs from the classic marker effects model in that the row space of the genotype matrix is reduced and an error term is fitted for non-core animals. We derived formulas for marker effect estimates that take this error term in account. The prediction error variance (PEV) of the marker effect estimates depends on the PEV for core animals but not directly on the PEV of the non-core animals. We extended the APY-SNP-BLUP to include a residual polygenic effect and accommodate non-genotyped animals. We show that selecting a specific group of core animals is equivalent to select a subspace of the row space of the genotype matrix. As the number of core animals increases, subspaces corresponding to different sets of core animals tend to overlap, showing that random selection of core animals is algebraically justified. Conclusions The APY-(ss)GBLUP models can be expressed in terms of marker effect models. When the number of core animals is equal to the rank of the genotype matrix, APY-SNP-BLUP is identical to the classic marker effects model. If the number of core animals is less than the rank of the genotype matrix, genotypes for non-core animals are imputed as a linear combination of the genotypes of the core animals. For estimating SNP effects, only relationships and estimated breeding values for core animals are needed. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-022-00741-7.

equal to the number of genotyped animals n gt . On the other hand, marker effects models fit a random effect, with the number of levels equal to the number of markers (m) . To obtain predictions for the first type of models using Henderson's mixed model equations (MME), the inverse of the genomic relationship matrix (G) is needed [3]. However, direct inversion of G when n gt is large is very expensive or even not feasible-for instance, genomic evaluation data for US dairy cattle data contain millions of genotyped animals. In contrast, marker effect models do not suffer from this constraint because the size of the block of equations concerning marker effects remains constant. Besides this advantage, marker effect models require multiplications of dense matrices and the condition number (i.e. the quotient between the largest and smallest eigenvalue of the MME) of the system is larger than for the breeding value models [4]. To handle these issues, complex solvers [5] and refined convergence criteria [6] are needed.
Increasing numbers of animals are genotyped every year but the number of markers ( m ) remains constant at around 50K [7], and therefore the rank of G is (at most) 50K. To make the breeding value models feasible with large n gt , several strategies have been proposed. Misztal et al. [8] and Misztal [9] developed the Algorithm for Proven and Young (APY), which uses a sparse representation of G −1 . The APY consists of selecting a group of genotyped animals, known as core animals, which are selected to span roughly 99% of the eigenvalue spectra of G , and then expressing the breeding values of the remaining genotyped animals, known as non-core animals, as a linear function of the breeding values of the core animals plus an error term. Mäntysaari et al. [10] proposed the GT best linear unbiased predictor (GTBLUP) model, which uses the Woodbury Identity [11] to obtain the inverse of G plus a regularization matrix without explicitly computing G −1 . The GTBLUP model requires more operations than APY when the number of core animals is less than m [10]. Fernando et al. [12] reviewed and developed different alternatives to the APY. In their study, the most practical implementation (Strategy IV) consisted of fitting a model with a design matrix that results from orthonormalization of the rows of the genotype matrix, and then obtaining the estimated breeding values as the product between the design matrix and the vector of solutions. APY, GTBLUP, and Strategy IV were compared in several scenarios and gave similar predictions when the spectrum of G was well covered by core animals [10,13].
The lack of clear-cut, deterministic selection criteria for core animals in APY has been criticized [10,12]; however, there are empirical studies on which [14] and how many [15] animals to include in the core group. However, the implications of using APY in terms of indirect predictions (direct genomic values) or single nucleotide polymorphism (SNP) effects are also unclear.
Although there are empirical results showing the reliable behavior of APY for prediction, an analytical framework to justify and examine its properties is lacking. Thus, the objective of this study was to derive a marker effects model that is equivalent to the APY. Departing from such a model, the secondary objectives of this study were to derive appropriate formulae for SNP and indirect predictions for genotyped animals without progeny and records, and to show the theoretical implications of selecting a specific group of core animals in APY.

Theory
′ be the vector of breeding values, where u c represents the sub-vector of breeding values for the core animals and u n is the sub-vector of breeding values for the non-core animals. Hereafter, the sub or superscript c will denote an object belonging to the core animals, whereas the sub or superscript n will indicate that an object belongs to the non-core animals. For a conventional GBLUP model [1], the covariance matrix of u , assuming a genetic variance equal to unity, is equal to: where k is a scaling factor based on the level of heterozygosity (for instance: 1/2 p i q i ) or on the number of markers, and Z = Z c Z n is the matrix of genotypes for all the genotyped animals. Here it is assumed that Z was derived after genotype quality control [16], centering [2], and/or additional scaling [17]. Furthermore, it is assumed that the number of core animals is smaller than or equal to the number of markers and that the core animals are chosen such that G cc is non-singular. The APY approach is based on the following recursion [9]: where P nc = G nc G −1 cc , and ξ is an error term that is assumed to be independent of u c . Taking the variance of both sides in Eq. (2), with their respective inverses, leads to: where Var(ξ) = M nn = G nn − G nc G −1 cc G cn . In practice, and hereafter, M nn is assumed to be a diagonal matrix, that is, M nn = diag G nn − G nc G −1 cc G cn [9]. This implies that, conditional on core animals, the breeding values of non-core animals are conditionally independent, which is more explicitly shown in [8], and is a variant of the socalled "approximate kernel methods" [18,19]. It has to be noted that, even when G is full rank and invertible, G −1

APY
is not equal to G −1 .

APY-GBLUP
Based on the matrices from Eq. (3), the APY-GBLUP model is defined as: where y is the vector of phenotypes, b is the vector of fixed effects, e is the vector of error terms, X and W are design matrices, and σ 2 u and σ 2 e are the genetic and residual variances, respectively. Hereafter, it will be assumed that E[y] = Xb . Assuming multivariate normality for u and e , the MME for the APY-GBLUP model are: Holding its expectation and covariance structure, the model of Eq. (4) can be partitioned into core and non-core animals as follows: resulting in the following MME: where the superscripts in G APY denote the blocks of G −1 APY corresponding to a specific combination of core and noncore animals. Substituting u n = P nc u c + ξ in Eq. (6) leads to the following model: and the following MME: Then, the BLUP of u n is obtained as u n = P nc u c + ξ . These MME are similar to those shown in Eq. (13) in [12] but differ in the error term ξ.

An equivalent model: APY-SNP-BLUP
Assuming that the genetic value of the core animals is fully explained by the markers, then: where a is the vector of marker effects, and that is, P is the perpendicular projection operator [11] to C Z ′ c (i.e., the vector space spanned by the genotypes of the core animals). Assuming that kσ 2 u is the variance of marker effects, i.e. Var(a) = Ikσ 2 u , Var(ξ) = M nn σ 2 u , and cov(a, ξ) = 0 . Then: Assuming multivariate normality for a , ξ , and e , the MME for the APY-SNP-BLUP model are: If rank(Z c ) = rank(Z) , which is true when the number of core animals is equal to the number of markers and given a non-singular G cc , P = I . Further, M nn = 0 , and since ξ has null expectation, ξ = 0 . Consequently, for a sufficiently large number of core animals: u c = Z c a and Then, the model in Eq. (12) reduces to: and the MME are: which are the MME for a typical SNP-BLUP model [1]. Thus, APY-SNP-BLUP converges to the regular SNP-BLUP when the number of core animals increases.

Distribution of breeding values and marker effects with APY
By defining the breeding values as in Eq. (10) and keeping the distributional assumptions of a and ξ from Eq. (12), (11) Var(u n ) = Z n PVar(a)PZ ′ n + Var(ξ) the covariance matrix of the joint distribution of u and a is: Then, assuming normality for a and ξ , the conditional distribution of u on a is: Note that this is a degenerate normal distribution because its covariance matrix is non-positive definite. In the classical GBLUP and SNP-BLUP models [1], the variance of u conditional on a is zero (i.e., if marker effects are known, the breeding value is fully explained), whereas Eq. (18) shows that in APY, the variance of u conditional on a is nonzero for the non-core animals.
Conversely, the conditional distribution of a on u is: The BLUP of a given u = u can be obtained from Eq. (19) as [1]: which after algebra reduces to [20]: with variance equal to: Thus, in order to obtain predictions of marker effects from the APY-GBLUP model, only the estimated breeding values of the core animals and the corresponding blocks of matrices are needed.

Indirect predictions with APY
Indirect predictions u ip are estimated breeding values for animals without own records or progeny with records, based on estimated marker effects from the genetic evaluation. Therefore, an indirect-predicted animal is by definition a non-core animal. Using Eq. (18), indirect predictions with APY are calculated as: where Z ip is the centered and scaled genotype matrix for the indirect-predicted animals, and G ip,c is the genomic relationship matrix between indirect-predicted and core animals. Note that the genotype matrix Z ip has to be centered and scaled in exactly the same manner as the matrices Z = Z c Z n that are used for genomic evaluation by any method; otherwise, the algebra (i.e. Eq. (17)) used to derive the joint distribution of breeding values and markers is an approximation. Changing the centering of Z ip (i.e., using a different set of allele frequencies) results in a shift of the mean and, in addition, a different implicit ξ in the APY approximation. Changing the scale of Z ip (i.e., multiplying by a different k ) changes the scale of marker effects and of the indirect predictions.
Note that the leftmost expression in Eq. (23) is, as expected, the selection index formulation for estimation of breeding values. Two measures of uncertainty are associated with estimation of u ip . First, the uncertainty associated with the variance in Eq. (18) that arises from the error term in Eq. (10), i.e. due to the approximation in APY. This uncertainty is calculated similarly to a reliability: where g ii and m ii are the diagonal element of G and the element of M nn , respectively, corresponding to the i th animal. Equation (24) converges to one when ξ → 0 , which occurs when the size of the core increases. The second measure of uncertainty is the usual reliability associated with the prediction error variance of u ip | a , which is: where g i,c is the block of G that relates the i th individual with the core animals, and g c,i = g i,c ′ . The reliability associated with Eq. (25) is:

APY-GBLUP and APY-SNP-BLUP models with a residual polygenic effect
To consider an extra polygenic effect based on pedigree, G −1 APY in Eq. (5) can be calculated based on the following G * : where A 22 is the block of the numerator relationship matrix corresponding to the genotyped animals, L is the Cholesky factor or its approximation of A 22 [21], and β is the proportion of the residual polygenic effect. When the matrices from model Eq. (4) are constructed based on Eq. (27), the resulting model is designated as APY-GBLUP with a residual polygenic effect. In this case, the breeding values are not fully explained by the markers. Therefore: where ε ∼ MVN 0, βA cc σ 2 u is the vector of residual polygenic effects [22], A cc is the block of the numerator relationship matrix corresponding to the core animals, . Then, a marker effects model equivalent to APY-GBLUP with a residual polygenic effect is: The MME for the model in Eq. (29) are: where δ =

APY single-step GBLUP (APY-ssGBLUP) and APY single-step SNP-BLUP (APY-ssSNP-BLUP)
For a general ssGBLUP model [23] with APY [14], assuming a genetic variance equal to 1, the covariance matrix of the breeding values is equal to: where the subscripts 1 and 2 denote the non-genotyped and genotyped animals, respectively. Letting  , the inverse of H APY is equal to [24]: Then, the APY-ssGBLUP model is defined as: The MME for the model in Eq. (33) are: Following [25], the breeding values of non-genotyped animals can be written as: u 1 = A 12 A −1 22 u 2 + η , where η ∼ MVN 0, A 11 −1 and represents the imputation error [23,25]. Then, the breeding values of all animals can be expressed as: (31) Var , an equivalent model to that presented in Eq. (33) is: The MME corresponding to this model are: (27), the resulting model is called APY-ssGBLUP with a residual polygenic effect. In such a case, Eq. (35) is modified to: with the following MME:

Computational complexity of the APY-based models
In terms of pre-processing steps, i.e., preparation of the required matrices to set up the MME, the APY-GBLUP model has a cubic cost in the number of core animals, and a linear cost in both the number of markers and the number of non-core animals. In contrast, the APY-SNP-BLUP model has a quadratic cost in both the number of core animals and the number of markers, and a linear cost in the number of non-core animals. Per iteration, the computational cost of APY-GBLUP is quadratic in the number of core and linear in the number of noncore animals, while APY-SNP-BLUP is quadratic in the number of markers and linear in the number of non-core animals. Thus, as long as the number of core animals is smaller than the number of markers, APY-GBLUP will be computationally less demanding than APY-SNP-BLUP. If a residual polygenic effect is assumed, the choice of the core animals plays a major role in computational efficiency because of the computation of A −1 cc . For example, if unrelated core animals are chosen, then computations that involve A −1 cc are trivial. Regardless, the APY-SNP-BLUP model may not be of great practical interest, as it can be replaced by either GBLUP or SNP-BLUP. However, the APY-SNP-BLUP model is useful to derive analytical properties of the APY algorithm.
Under single-step models, computational requirements for genetic evaluation of genotyped animals depend on how A −1 22 is included in the MME for both breeding value and marker-based models. That is, A −1 22 is included either as the product of A −1 22 times a vector for breeding value models [26] or as the solution of the sparse system A 11 Z † = −A 12 Z † for marker based models [25], where Z † is the matrix of imputed genotypes.

Discussion
In this study, we derived a marker effects model that is equivalent to the APY-GBLUP model. On the one hand, when the number of core animals is equal to the rank of the genotype matrix, G APY is singular, as noted by [12]. This can be interpreted as core animals covering all the genetic variation in the population that is captured by the genotyped markers. In that case, APY-GBLUP is equivalent to a regular GBLUP or SNP-BLUP, which has been analytically proven in this study (for further information, a numerical illustration is provided in Additional file 1). On the other hand, when the number of core animals is lower than the rank of the genotype matrix, G APY is non-singular and a marker effects model, named APY-SNP-BLUP, can be constructed that is equivalent to APY-GBLUP. The APY-SNP-BLUP differs from the regular SNP-BLUP model in the following manner: (i) it has a reduction in the row and column spaces of Z (from its replacement with Z † ), and (ii) it has an additional error term (ξ) for non-core animals. The former indicates that the number of possible genotypes and haplotypes that can be formed by a linear combination of the rows or columns of the genotype matrix is reduced. The degree of reduction of both spaces is controlled by the number of core animals. With a fixed number of core animals, selecting different sets of core animals is equivalent to selecting different subspaces of the row and column spaces of Z . When the number of core animals is such that a large portion of the spectrum of G (or the set of singular values of Z ) is covered, those sub-spaces that correspond to different sets of core animals will increasingly overlap, which means that their intersection will not be equal to the null vector space. In that case, many genotypes in the population can be generated as linear combinations of the rows of Z c . If a certain genotype, say z i , cannot be formed, the distance between z i and its projection to the row space of Z c (d(z i − Pz i )) is in general expected to be small. The particular choice of core animals is not important, as long as it covers the spectrum of G . Although the heuristics of how core animals should be chosen needs to be refined, a wide random choice appears to be adequate [10,13,27,28]. However, as the population evolves and new haplotype combinations are created, the chosen core may become less representative, although this is expected to be a slow process. Methods to choose and update a set of core animals that spans most variability in G, e.g., [29], is an open area of research.
The large changes in estimated breeding values for some non-core animals when the core animals are changed, while keeping its number constant, as reported by [30], can be explained by a large distance between the projected and the real genotype of those individuals. This is the case, for instance, when a mislabeled animal is evaluated within one breed but actually belongs to another breed.
Note also that core animals do not need to be individuals with, a priori, high reliability. Marker effects are back-solved from core animals, but these animals gather information from the entire population through the genomic relationships. Thus, the accuracy of the estimated breeding values of these core animals will be very high. For instance, in dairy cattle, core animals could be based on cows. If these cows have a strong relationship with the whole population, they are very accurately estimated, and so will be the SNP effects and indirect predictions.
We also derived the distribution of breeding values conditional on the marker effects and vice versa when using APY. For the first case, breeding values of the noncore animals are only partially explained by the marker effects because of the error term ξ in Eq. (10). For the second case, we showed that the BLUP of the marker effects conditional on the breeding values of animals, only requires the matrices and estimated breeding values corresponding to the core animals-this result was not known before. Although non-core animals do not appear in the explicit calculation of the marker effects, their information is used to estimate breeding values of the core animals. When Eq. (21) is not used to obtain estimates for the marker effects from an APY-GBLUP model, those estimates will not have minimum variance.
In genetic evaluations, indirect predictions for animals without own records or progeny with records can be calculated from estimates of marker effects that are obtained by back-solving from the estimated breeding values [20]. If an APY-(ss)GBLUP model is used for genetic evaluations then the proper way to calculate indirect predictions is based on the distribution in Eq. (18). For core animals, the formula to calculate indirect predictions is equal to the Eq. (10) in [20]. However, for genetic evaluation, animals for which indirect predictions are calculated are, by definition, non-core animals.
Therefore, in that case, their indirect predictions must be obtained from Eq. (23).
Two measures of uncertainty associated with indirect predictions were derived. The first measure of uncertainty, Eq. (24), quantifies how different the indirect prediction would be from that calculated with a regular GBLUP or SNP-BLUP model. If the number of core animals is large enough, this measure will be close to 1. A value of 1 indicates that the individual is in the space of genetic variation described by the core animals and indirect predictions based on APY and SNP-BLUP are identical. The second measure of uncertainty, based on Eq. (26), is a classical reliability of estimated breeding value and is a function of the prediction error variance of the indirect prediction in Eq. (25). This reliability can be expressed in terms of prediction error variances of marker effects or in terms of prediction error variances of core animals. The latter results in higher computational efficiency because the number of core animals is smaller than the number of markers.
In the same way that an equivalent SNP-BLUP model exists for the APY-BLUP model, we showed that when genotyped and non-genotyped animals are combined in the evaluation using a single-step approach, there is an APY-ssSNP-BLUP model that is equivalent to the APY-ssGBLUP model. In the APY-ssGBLUP model, breeding values are jointly estimated for non-genotyped, core, and non-core animals, while the APY-ssSNP-BLUP model estimates SNP effects based on the core animals, an error term for non-core animals, and the genotype imputation error for non-genotyped animals. Estimates of breeding values for all the animals can then be obtained by a linear combination of the corresponding design matrices and the vector of solutions. As in Fernando et al. [31], estimated breeding values for nongenotyped animals can be obtained directly to improve computational efficiency. Adding the polygenic effect does not change the MME for APY-ssGBLUP but adds an extra term ( ε ) to the MME for APY-ssSNP-BLUP. In that case, the number of unknowns is equal to that in the model presented by [22]. Therefore, APY-ssSNP-BLUP is more complex and involves more convoluted matrix multiplications. Whenever the MME are augmented, there is a question on the feasibility and convergence of the model with real datasets. Vandenplas et al. [5] proposed a second-level preconditioner to ease convergence problems in ssSNP-BLUP, and Vandenplas et al. [6] presented a different termination criterion to determine convergence of such models. Although APY-ssSNP-BLUP is more flexible regarding the use of different priors for marker effects [25], its convergence will need to be monitored carefully.